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On Reverse Pinsker Inequalities 

Igal Sason 


Abstract 

New upper bounds on the relative entropy are derived as a function of the total variation distance. 
One bound refines an inequality by Verdu for general probability measures. A second bound improves 
the tightness of an inequality by Csiszar and Talata for arbitrary probability measures that are defined on 
a common finite set. The latter result is further extended, for probability measures on a finite set, leading 
to an upper bound on the Renyi divergence of an arbitrary non-negative order (including oo) as a function 
of the total variation distance. Another lower bound by Verdu on the total variation distance, expressed in 
terms of the distribution of the relative information, is tightened and it is attained under some conditions. 
The effect of these improvements is exemplified. 

Keywords: Pinsker’s inequality, relative entropy, relative information, Renyi divergence, total 
variation distance, typical sequences. 


I. Introduction 


Consider two probability measures P and Q defined on a common measurable space {A,F). 
The Csiszar-Kemperman-Kullback-Pinsker inequality states that 


D{P\\Q)>^^-\P-Q\^ 


where 


D{P\\Q)=¥.p 


log 


2 

dP 

dQ 


= / dP log 


M 


dP 

dg 


( 1 ) 

( 2 ) 


designates the relative entropy from P to Q (a.k.a. the Kullback-Leibler divergence), and 


\P - Q\ = 2 sup\P{A) - Q{A)\ (3) 

designates the total variation distance (or Li distance) between P and Q. One of the implications 
of inequality ([T]) is that convergence in relative entropy implies convergence in total variation 
distance. The total variation distance is bounded |P — g| < 2, in contrast to the relative entropy. 

Inequality ([T]) is a.k.a. Pinsker’s inequality, although the analysis made by Pinsker IfTSl leads 
to a significantly looser bound where on the RHS of ([T]) is replaced by (see ll^ 
Eq. (51)]). Improved and generalized versions of Pinsker’s inequality have been studied in Q, 

II, m, m, Id, m. 

For any e > 0, there exists a pair of probability measures P and Q such that |P —g| < e while 
P(P||g) = oo. Consequently, a reverse Pinsker inequality which provides an upper bound on 
the relative entropy in terms of the total variation distance does not exist in general. Nevertheless, 
under some conditions, such inequalities hold 01, 1251 . 12^ (to be addressed later in this section). 
If P <C g, the relative information in a E ,4. according to (P, Q) is defined to be 

dP 

*P||Q(a) - log —(a). (4) 
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From dUl, the relative entropy can be expressed in terms of the relative information as follows: 


D{P\\Q) = E[ip||Q(X)] = E[zp||Q(y) exp(zp||Q(y))] (5) 


where X ~ P and Y ^ Q (i.e., X and Y are distributed according to P and Q, respectively). 
The total variation distance is also expressible in terms of the relative information ll^ . If Q <C P 


|P-Q|=E |l - exp(ip||Q(y))| 
and if, in addition, P <C Q, then 

|P-Q|=E |l - exp(-zp||Q(X))| 

Let 


dP 


/3i = sup — (a) 

aeA 


( 6 ) 

(7) 

( 8 ) 


with the convention, implied by continuity, that /3i = 0 if ip\\Q is unbounded from above. With 
/?i < 1, as it is defined in ([8]), the following inequality holds (see |[25l Theorem 7]): 


l\P-Q\> ( 9 ) 

From do]), if the relative information is bounded from above, a reverse Pinsker inequality holds. 
This inequality has been recently used in the context of the optimal quantization of probability 
measures when the distortion is either characterized by the total variation distance or the relative 
entropy between the approximating and the original probability measures HJ Proposition 4]. 

Inequality dH) is refined in this work, and the improvement that is obtained by this refinement 
is exemplified (see Secfion HT]). 

In fhe special case where P and Q are defined on a common discrete set (i.e., a finite or 
countable set) A, the relative entropy and total variation distance are simplified to 


P(P||g) = J]P(a) log 

ciG.A 


Pja) 

Q(a)’ 


P-Q| = J]|P(a)-Q(a)|^|P-Q|i. 


A restriction to probability measures on a finite set A has led in l|4l p. 1012 and Lemma 6.3] 
to the following upper bound on the relative entropy in terms of the total variation distance: 


77(P||Q)< -IP-QP, 

\ ^/min / 


( 10 ) 


where Qmin — uiinag^ (5(a), suggesting a kind of a reverse Pinsker inequality for probability 
measures on a finite set. A recent application of this bound has been exemplified in |[T3l 
Appendix D] and Il23l Lemma 7] for fhe analysis of the third-order asymptotics of the discrete 
memoryless channel with or without cost constraints. 

The present paper also considers generalized reverse Pinsker inequalities for Renyi diver¬ 
gences. In the discrete setting, the Renyi divergence of order a from P to (5 is defined as 


Po.{P\\Q) = 


a — 1 



1—a ( 


Va G (0,1) U (1,oo). 


( 11 ) 
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Recall that Di{P\\Q) = D{P\\Q) is defined to be the analytic extension of Da{P\\Q) at a = 1 
(if D{P\\Q) < oo, it can be verified with L’Hopital’s rule that D{P\\Q) = limQ,_j,i- Da{P\\Q)). 
The extreme cases of a = 0, oo are defined as follows: 

• If a = 0 then Dq{P\\Q) = — log(5(Support(P)) where Support(P) = {x £ X: P{x) > 0} 
denotes the support of P, 

• If a = +00 then Doo{P\\Q) = log ^ess sup^^ where ess sup / denotes the essential 
supremum of a function /. 

Pinsker’s inequality has been generalized by Gilardoni 191 for Renyi divergences of order 
a E (0,1] (see also @ Theorem 30]), and it gets the form 

D^{P\\Q)>^^-\P-Q\\ 

An improved bound, providing the best lower bound on the Renyi divergence of order a > 0 
in terms of the total variation distance, has been recently introduced in ll20l Section 2]. 

Motivated by these findings, fhe analysis in this paper suggests an improvement over the upper 
bound on the relative entropy in (fTOl) for probability measures defined on a common finife sef. 
The improved version of fhe bound in (ITOl ) is further generalized to provide an upper bound on 
the Renyi divergence of orders a E [0, oo] in terms of the total variation distance. 

Note that the issue addressed in this paper of deriving, under suitable conditions, upper bounds 
on the relative entropy as a function of the total variation distance has some similarity to the 
issue of deriving upper bounds on the difference between entropies as a function of the total 
variation distance. Note also that in the special case where Q is a Gaussian distribution and P is 
a probability distribution with the same covariance matrix, then D{P\\Q) = h{Q) — h{P) where 
/i(-) denotes the differential entropy of a specified distribufion (see ||3l Eq. (8.76)]). Bounds on 
fhe enfropy difference in ferms of the total variation distance have been studied, e.g., in 13] 
Theorem 17.3.3], El, HU, El, (H, l26l Section 1.7], gTl. 

This paper is structured as follows: Section HIl refers to ll25l . deriving a refined version of 
inequalify (|9l) for general probability measures, and improving another lower bound on the total 
variation distance which is expressed in terms of the distribution of the relative information. 
Sectioninijderives a reverse Pinsker inequality for probability measures on a finite set, improving 
inequality (fTOl ) that follows from 14] Lemma 6.3]. Section |IV] extends the analysis to Renyi 
divergences of arbitrary non-negative orders. Section|V]exemplifies the utility of a reverse Pinsker 
inequality in the context of typical sequences. 

II. A Refined Reverse Pinsker Inequality for General Probability Measures 

The present section derives a reverse Pinsker inequality for general probability measures, 
suggesting a refined version of l25l Theorem 7]. The utility of this new inequality is exemplified. 
This secfion also provides a lower bound on the total variation distance which is based on the 
distribution of the relative information; the latter inequality is based on a modification of the 
proof of ll25l Theorem 8], and it has the advantage of being tight for a double-parameter family 
of probability measures which are defined on an arbitrary set of 2 elements. 

A. Main Result and Proof 

Inequality (|9l) provides an upper bound on the relative entropy D{P\\Q) as a function of the 
total variation distance when P Q, and the relative information ip\\Q is bounded from above 
(this implies that /3i in ([8]) is positive). The following theorem tightens this upper bound. 
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Theorem 1: Let P and Q be probability measures on a measurable space {A,P), P Q, 
and let /3i,/32 G [0,1] be given by 

/ 3-1 4 sup ^(a), /32 = inf ^(o)- (12) 

aGA aGA (1^/ 

Then, the following inequality holds: 

-/32loge I |P-g|. (13) 


D{P\\Q) < 2 


l-/3i 


Proof: Let X ^ P, Y ~ Q, and 

B = {a £ A: ip\\Q{a) > O}. 


(14) 


From (l5]l, the relative entropy is equal to 




dg exp(ipiiQ) ipiiQ 


dg exp(ipiiQ) ipiiQ + / dg exp(ip||Q) ipyg. 

! Ja\b 


(15) 


In the following, the two integrals on the RHS of (fTSl) are upper bounded. The upper bound on 
the first integral on the RHS of ([TS] ) is based on the proof of ||251 Theorem 7]; it is provided 
in the following for completeness, and with more details in order to clarify the way that this 
bound is refined here. Let z(a) = exp(fp||Q(a)) for a G .A. By assumption 1 < z{a) < ^ for 

all a £ B. The function f{z) = ^ is monotonic increasing over the interval (l,oo) since 
we have {z — l^/fz) = {z — 1) log e — log z > 0 for z > 1. Consequently, we have 


z{a) log 2 :(a) ^ ^og 
z{a) - 1 ~ I- j3i 


ya£B 


(16) 


and 


/ dg exp(ipiiQ) ipiiQ 
JB 

W log^- 


< 


/ dg (exp(zpiiQ) - l) 
JB 

(1) log^ 


1 “ /3l JB 


1 - /3i Ja 

(J / log jr ' 


(d) 


log^ 


2(1-A) 


dg(a) (l - exp(fp||Q(a))) 
E (1-exp(ip||Q(y))) 

\P-Q\ 


(17) 


where inequality (a) follows from (fT^ . equality (b) is due to (fT4l) and the definition (a) = 

—al{a < 0}, equality (c) holds since 1" ~ g, and equality (d) follows from Il25l Eq. (14)]. 












At this point, we deviate from the analysis in 11251 where the second integral on the RHS of 
(fTSl) has been upper hounded hy zero (since ip||g(a) < 0 for all o E ,4 \ B). If /?2 > 0, we 
provide in the following a strictly negative upper hound on this integral. Since P <C Q, we have 

dQ exp(ipiiQ) ip\\Q 


Ja\b 

- [ dQ(a) ^(a) ip\\Q{a) 

J{a^A: ip\\Q{a)<0} 


(b) 

< P 2 


I 


*p||q(“)< 0 } 


dQ(a) ip\\Q{a) 


(c) 

< ^2 log e 


' {a&A-. *p||Q(a)<0} 


dQ(a) (exp(ip||Q(a)) - 1 


(d) 


= -^2 loge y^dQ(a) (^1 - exp(ip||Q(a 


^2 loge / d(5(a) (^1 - exp(ip||Q(a))^ 


lA\B 


— -h log e • E 


1 - ex 


(J 


1^2 log e 
2 




(18) 


where equality (a) holds due to ([4]), (1141) and since the integrand is zero if ip\\Q = 0, inequality (h) 
follows from the definition of /)2 in (fT2l) and since ip\\Q is negative over the domain of integration, 
inequality (c) holds since the inequality x < loge(exp(x) — l) is satisfied for all x E M, 
equalities (d) and (e) follow from the definition of the set B in (fT4l) . equality (f) holds since 
Y pp Q, and equality (g) follows from ll25l Eq. (15)]. 

Inequality (fTSl) finally follows hy combining ([T51) . (ITtI) and ([18]). ■ 


B. Example for the Refined Inequality in Theorem [7] 

We exemplify in the following the improvement obtained by (IT3]) . in comparison to (|9|), due 
to the introduction of the additional parameter (52 in Gil)- Note that when (52 is replaced by 
zero (i.e., no information on the infimum of ^ is available or (52 = 0), inequalities (|9]) and ([T3l) 
coincide. 

Let P and Q be two probability measures, defined on a set A, where P Q and assume 
that 

dP 

+ VaE,A (19) 

for a fixed constant q €(0,1). 

In ([T3]) . one can replace (5i and (52 with lower bounds on these constants. From (IT2l) . we have 
(5i > and (52 > 1 — q, and it follows from ([TSll that 

o(P||Q)<i(€€^^€f€±^-(l-.)log.) \p-Q\ 

- + (1 -q)loge^ \P - Q\ 

= (f?loge)|P - Q\. 


( 20 ) 
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From (O 


|exp(ip||Q(o)) - l| < T/, VoG^ 
so, from ®, the total variation distance satisfies (recall that Y Q) 


\P-Q\=E |exp(zp||Q(y)) - 1 

Combining the last inequality with (l20l) gives that 

-D(F’||(5) < 7/^ loge, Vr/G(0,1) 
For comparison, it follows from (|9l) (see |[25l Theorem 7]) that 


< TJ. 


D{P\\Q) < 


< 


2(1-A) 

(1 + 7/) log(l + 7/) 


27/ 


\P-Q\ 


1 


<2(1 + ^) + 7) 


< 


loge 


7/(1 + 7 /). 


( 21 ) 


( 22 ) 


Let 7/ Ri 0. The upper hound on the relative entropy in (l22l) scales like 7/ whereas the tightened 
hound in (|2TI) scales like tf. The scaling in (|2TI) is correct, as it follows from Pinsker’s inequality. 
For example, consider the prohahility measures defined on a two-element set A = {a, b} with 


P{a) = Q{b) = ^-^, P(6) = Q(a) = i + |. 

Condition (fT9l ) is satisfied for 7 / Ri 0, and Pinsker’s inequality yields that 

D{P\\Q) > 7/2 (23) 

so the ratio of the upper and lower hounds in (1271) and (1231) is 2, and both provide the true 
quadratic scaling in rj whereas the weaker upper bound in (l22l) scales linearly in 7 / for ?/ Ri 0. 


C. Another Lower Bound on the Total Variation Distance 


The following lower bound on the total variation distance is based on the distribution of the 
relative information, and it improves the lower bounds in ifTSl Eq. (2.3.18)], ll22l Lemma 7] and 
II 25 I Theorem 8] by modifying the proof of the latter theorem in ll25l . Besides of improving the 
tightness of the bound, the motivation for the derivation of the following lower bound is that it 
is achieved under some conditions. 

Theorem 2: If P and Q are mutually absolutely continuous probability measures, then 
\P -Q\> sup|(l -exp(- 7 /)) (^P[ip||Q(X) > 7/] +exp( 7 /) P[ip||Q(X) < - 7 /]^| (24) 


where X ~ P. This lower bound is attained if P and Q are probability measures on a 2-element 
set A = {a, b} where, for an arbitrary r/ > 0, 


P(a) 


exp(r/) — 1 
2 sinh(r/) ’ 


Q{a) 


1 — exp(— 7 /) 
2 sinh(7/) 


(25) 










Proof: Since P ( 5 , wc have 

IP-QI =E[|l-exp(-ip||Q(X))|] 

> E[|l -exp(-zp||Q(X))| 1 {|zp||q(X)| > r/}], V?? > 0 

where !{•} is the indicator function of the specified event (it is equal to 1 if the event occurs, 
and it is zero otherwise). At this point we deviate from the proof of ll25l Theorem 8], and write 

\P-Q\ >E[|l-exp(-ip||Q(X))| l{ip||Q(X) > r/}] 

+ E[|l -exp(-ip||Q(X))| l{ip||Q(X) < -r/}] 

(a) 

> (1 - exp(-r/)) E[l{ip||Q(X) > rj}] + (exp(7/) - l) E[l{ip||Q(X) < -t]}] 

= (1 -exp(-r/)) (^P[zp||q(X) > ij] +exp(7/) P[zp||q(A) < -rj]^ (26) 

where step (a) follows from the inequality |l — exp(— 2 ;)| > 1 — exp(— 77 ) if z > rj, and 

|l — exp(—z)| > exp(r 7 ) — 1 if z < —rj. Taking the supremum on the right-hand side of (l26l) . 

w.r.t. the free parameter rj > 0, gives the lower hound on \P — Q\ in (l24l) . 

The condition (l25l) for the tightness of the lower hound in (l2^ follows from the fact that, 
for an arbitrary t/ > 0, we have log ^ and log = —rj. This yields that the 

inequalities in the derivation of the lower bound (l2^ turn to be satisfied with equalities. ■ 
Remark 1: One can further tighten the lower bound in (l2^ by writing, for arbitrary r/i, 772 > 0, 

\P-Q\> E[|l -exp(-iip||Q(A))| l{ip\iQ{X) > 771 }] 

+ E[|l-exp(-ip||Q(X))| 1 { 7 p||q(X) < - 772 }] 

and proceeding similarly to (l26l) to get the following lower bound on the total variation distance: 

\P-Q\> sup |(l - exp(- 77 i)) fp[ 7 p||Q(X) > 771 ] 

vi,V2>o ( V 

This lower bound is achieved if P and Q are probability measures on a 2-element set A = {a, b} 
where, for an arbitrary 771,772 > 0, 

exp( 771 )-exp( 771 - 772 ) l-exp(-772) 

-P(®) / \ 7 ^ ’ QA) f \ / \ (27) 

exp(77i) - exp(-772) exp(77i) - exp(-772) 

which implies that log = Vi and log = — 772 . Condition (|27]) is specialized to 

the condition in (l25l) when 771 = 772 = 77 > 0. 


III. A Reverse Pinsker Inequality for Probability Measures on a Finite Set 

The present section introduces a strengthened version of inequality (fTOl) (see Theorem [3) as 
a reverse Pinsker inequality for probability measures on a finite set, followed by a discussion 
and an example. 









A. Main Result and Proof 

Theorem 3: Let P and Q be probability measures defined on a common finite set A, and 
assume that Q is strictly positive on A. Then, the following inequality holds: 

l-P-QPA /32loge 


D{P\\Q) < log ( 1 


+ 




2 Q„ 


where 


Q. 


= min Q{a) > 0, 




/32 = min^^ e [0,1]. 


(28) 


(29) 


Remark 2: The upper bound on the relative entropy in Theorem [3] improves the bound in 
(fTOl) . The improvement in (|2^ is demonstrated as follows: let V = \P — Q\, then the RHS of 
(|2^ satisfies 


log 1 + 




2 Q„ 


132 loge 


< log ( 1 + 


1/2 


2 g„ 


^ loge ^ V'^ loge 


2Qr 


Q. 


Hence, the upper bound on D{P\\Q) in Theorem [3] can be loosened to (ITOl) . 

Proof: Theorem [3] is proved by obtaining upper and lower bounds on the x^-divergence 
from P to g 


x\P,Q)^Y1 


(Pja) - Qia)f 
Q{a) 


E 

<2G A. 


Pja)' 

Q{a) 


- 1 . 


A lower bound follows by invoking Jensen’s inequality: 

P{af 


X 


’iP,Q) = Yl 


ciG A 


g(a) 


- 1 


^ P{a) exp ( log 


<2G A 


> exp ^ P{a) log 




Pja) 

Q{a) 

P{a) 

Q{a) 


- 1 


- 1 


= exp(P(P||g)) - 1. 


(30) 


(31) 


A refined version of (l3T]) is derived in the following. The starting point of its derivation relies on 
a refined version of Jensen’s inequality from l|5] Theorem 1], which enables to get the inequality 


min^ • OmP) < log(l + X\P,Q)) - D{P\\Q) < max^ • P(g||P). (32) 

aeA Q[aj a&A Q{a) 

Inequality (l32l) is proved in the appendix. From the LHS of (1^ and the definition of P 2 in 
dUl), we have 

x'(^’,g) > exp(zJ(P||g)+/32P(g||P)) -1 

> exp (^D{P\\Q) + • |P - g|2^ - 1 (33) 

where the last inequality relies on Pinsker’s lower bound on P(g||P). Inequality (l3^ refines 
the lower bound in (l3TI) since (32 G [0,1], and it coincides with (l3T]) in the worst case where 
/32 = 0 . 
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An upper bound on x^iP^Q) is derived as follows: 




Q{a) 


< 


< 


and, from (O, 


Ea6A(-P(«) -Q(«))^ 

minaG^(5(a) 

maxgs^ \P{a) - Q{a)\ Eag^|-P(«) " Q(«)| 
minae^ Q{a) 

\P - Q\ maxgg^ \P{a) - Q{a)\ 

Qrain 


l-P — Q| > 2 max |P(a) — Q{a)\ 
a£A 


(34) 


(35) 


since, for every a G A, the 1-element set {a} is included in the ci-algebra P. Combining (l34l) 
and (|35] ) gives that 

\P-Q\^ 


xAP,Q)< 


2Qr 


(36) 


Inequality (|2^ finally follows from the bounds on the x^-divergence in (l3^ and 
Corollary 1: Under the same setting as in Theorem [3l we have 

\P-Q\^ 


D{P\\Q) < log 1 + 


2Qr 


(37) 


Proof: This inequality follows from (1281) and since /32 > 0. 


B. Discussion 


In the following, we discuss Theorem [3] and its proof, and link it to some related results. 
Remark 3: The combination of (|3TI ) with the second line of (l34l ). without further loosening 
the upper bound on the x^-divergence as is done in the third line of (l34b and inequality (1351) . 
gives the following tighter upper bound on the relative entropy in terms of the Euclidean norm 

\p-Q\2- 

D{P\\Q) < log (38) 

\ ^/min / 

This improves the upper bound on the relative entropy in the proofs of Property 4 of |[23l 
Lemma 7] and |[T3l Appendix D]: 


D{P\\Q) < 


\P-Q\l loge 

Qmin 


Furthermore, avoiding the use of Jensen’s inequality in (|3T]) . gives the equality (see l|6l Eq. (6)]) 


x'(P,Q)=exp(Z72(P||Q)) -1 (39) 

whose combination with the second line of (l34l) gives 

D 2 {P\\Q) < log (l + . (40) 

\ t^min / 
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Inequality (l40l) improves the tightness of inequality (l3^ . Note that (l40l) is satisfied with equality 
when Q is an equiprohahle distrihution over a finite set. 

Remark 4: Inequality (l3TI) improves the lower hound on the x^-divergence in lU Lemma 6.3] 
which states that y^{P,Q) > D{P\\Q)\ this improvement also follows from lH Eqs. (6), (7)]. 

Remark 5: The upper hound on the relative entropy in (|2^ involves the parameter /32 G [0,1] 
as defined in (l29l ). A non-trivial lower hound on f32 can he used in conjunction with (|2^ for 
improving the upper hound in Corollary [T] We derive in the following a lower hound on (32 for 
a given prohahility measure Q and a given total variation distance \P — Q\, which can he used 
in conjunction with (|2^ . to get an upper hound on the relative entropy D{P\\Q). We have 

Q ■ \ Pm'm 

a^A Qmax 


where 


Pmin = miriP(a), 
a^A 


Qmax — max(5(®)' 

aeA 


Note that, if \P — Q\ < Qmin then P^m > Qmin — IT* — <31 > 0. Let (x)’*' = maxjx, O}, then 


/32 > 


{Qmin-\P-Q\y 

Qmax 


(41) 


Remark 6: In an attempt to extend the concept of proof of Theorem |3] to general prohahility 
measures, we have 

2 


x\P,Q) = 

= E 


A 


dP 

dQ 


-1 dQ 


(exp(fp||Q(y)) -1)^1 (y ~ Q) 


< sup |exp(fp||Q(a)) - l| • E |exp(zp||Q(y)) - l| 


d^A 


= sup |exp(fp||Q(a)) - l| • |P - Q| 
aGA 


sup 

(2GA 


dP 

dQ 


(a)-l -IP-QI 


(42) 


where equality (a) holds due to @. Let /3 i,/ 32 G [0,1] he defined as in Theorem [T] (see (fT^ i. 
Since we have /32 < ^ (a) < /3(~^ for all a G A then 


sup 

(2GA 


dP r . 


< max{/3j^ ^ - 1,1 - ( 32 }- 


(43) 


A combination of (l4^ and (|4^ leads to the following upper hound on the -divergence: 

xHP,Q) < max{/3r^ - 1,1 - ( 32 } ■ \P - Q\. (44) 

A comhination of (l39l ) (see l|6l Lq. (6)]) and (l44l) gives 

D 2 {P\\Q) < log(l + max{/3r^ - 1,1 - ( 32 } ■ \P - Q|) (45) 

and since the Renyi divergence is monotonic non-decreasing in its order (see, e.g., |0 Theo¬ 
rem 3]) and D{P\\Q) = Di{P\\Q), it follows that 

D{P\\Q) < log(l + max{/3r^ - 1,1 - ( 32 } ■ \P - Q|). 


(46) 













11 

A comparison of the upper bound on the relative entropy in (l46l) and the bound of Theorem [T] 
in (fTSl) yields that the latter bound is superior. Hence, the extension of the concept of proof of 
Theorem [3] to general probability measures does not improve the bound in Theorem [T] 

Remark 7: The second inequality in (l3^ relies on Pinsker’s inequality as a lower bound on 
D{Q\\P). This lower bound can be slightly improved by invoking higher-order Pinsker’s-type 
inequalities (see lH Section 5] and references therein). In 0 Section 6], Gilardoni derived a lower 
bound on the relative entropy which is tight for both large and small total variation distances. 
Hence, the second inequality in (l3^ can instead rely on the inequality (see 0 Eq. (2)]): 

,,(C3||P) > - (i - - (i - (i p E^). 

Note that although the latter lower bound on the relative entropy is tight for both large and small 
total variation distances, it is not uniformly tighter than Pinsker’s inequality. For this reason and 
for the simplicity of the bound, we rely on Pinsker’s inequality in the second inequality of 
(|3^ . An exact parametrization of the minimum of the relative entropy in terms of the total 
variation distance was introduced in 0 Theorem 1], expressed in terms of hyperbolic functions; 
the bound, however, is not expressed in closed form in terms of the total variation distance. 

Remark 8: A related problem to Theorem |3] has been recently studied in fl]. Consider an 
arbitrary probability measure Q, and an arbitrary e G [0,2]. The problem studied in HI is the 
characterization of D*{s, Q), defined to be the infimum of D{P\\Q) over all probability measures 
P that are at least e-far away from Q in total variation, i.e., 

D*{£,Q)= inf D{P\\Q), eG[0,2]. 

P: |P-Q|>£ 

Note that D{P\\Q) < oo yields that Supp(P) C Supp(Q). From Sanov’s theorem (see 0 
Theorem 11.4.1]), Q) is equal to the asymptotic exponential decay of the probability that 
the total variation distance between the empirical distribution of a sequence of i.i.d. random 
variables and the true distribution (Q) is more than a specified value e. Upper and lower bounds 
on D*{£, Q) have been introduced in 0 Theorem 1], in terms of the balance coefficient 
that is defined as 

/3 = inf I X G {QiA): A G p} ■ x > - 
It has been demonstrated in 0 Theorem 1] that 

D*{e,Q) = C£^ + 0{e^) (47) 



where 


log 


/3 


<C < 


log e 


4(2/3-1) - 8/3(1-/3)' 

If the support of Q is a finite set A, Theorem |3] and (|4TI) yield that 


P*(e,g) <log 1 + 


log e 1 


2 Qicnax 

Hence, it follows that D*{e, Q) < Ci£^ + O(e^) where 

loge / 1 Qmi 


2Q min y 

.2 I 




Cl = 


gmin Qt 



Similarly to (1471 ). the same quadratic scaling of D*{£,Q) holds for small values of £, but with 
different coefficients. 
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C. Example: Total Variation Distance From the Equiprobable Distribution 

Let ^ be a finite set, and let U be the equiprobable probability measure on A (i.e., f7(a) = 
for every a E A). The relative entropy of an arbitrary distribution P on A with respeet to the 
equiprobable distribution satisfies 


D{P\\U)=\og\A\-H{P). 


(48) 


From Pinsker’s inequality ([T]), the following upper bound on the total variation distanee holds: 

From |[2^ Theorem 2.51], for all probability measures P and Q, 

\P-Q\< 2^1-exp(-I)(P||g)) 
which gives the second upper bound 

\P-U\<2^l-^^-ox.p{H{P)). (50) 

From Theorem [3] and (|4T1) . we have 

D[P\\U) < log (^1 + ^ . |P - .(A.\p.u\y-IP-U\\ 

A loosening of the latter bound by removing its second non-negative term on the RHS of this 
inequality, in conjunction with (1481) . leads to the following closed-form expression for the lower 
bound on the total variation distance: 


P-U\>J2{exp{-H{P))- 


|,A| 


(51) 


Let H{P) = /? log |,4.|, so /3 E [0,1]. From (l49l ). (l50l) and (1511) . it follows that 
^2 ^ 1 ^^ <\P-U\< min|v'2(l - /3)ln\A\, 2^1 - 


(52) 


As expected, if /3 = 1, both upper and lower bounds are equal to zero (since D{P\\U) = 0). 
The lower bound on the LHS of (l52l) improves the lower bound on the total variation distance 
which follows from (fTOl) : 


For example, for a set of size |,4.| = 1024 and /3 = 0.5, the improvement in the new lower 
bound on the total variation distance is from 0.0582 to 0.2461. 

Note that if /3 —)> 0 (i.e., P is far in relative entropy from the equiprobable distribution), and 
the set A stays fixed, the ratio between the upper and lower bounds in (l52l) tends to \/2. On the 
other hand, in this case, the ratio between the upper and the looser lower bound in (1531 ) tends to 


2 


In 1^1 ’ 


which can be made arbitrarily large for a sufficiently large set A. 
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IV. Extension of Theorem [3] to Renyi Divergences 

The present section extends Theorem [3] to Renyi divergences of an arbitrary order a G [0, oo] 
(i.e., it relies on Theorem [3] to provide a generalization of the special case where a = 1), and 
the use of this generalized inequality is exemplified. 


A. Main Result and Proof 

The following theorem provides a kind of a generalized reverse Pinsker inequality where the 
Renyi divergence of an arbitrary order a G [0, oo] is upper bounded in terms of the total variation 
distance for probability measures defined on a common finife sef. 

Theorem 4: Lef P and Q be probabilify measures on a common finife sef A, and assume fhaf 
P, Q are sfricfly posifive. Let e =\P — Q\ (recall that e G [0, 2]), e' = min{l, e}, and 

Pmin = minP(a), Qmin = min (5(a). 


Then, the Renyi divergence of order a G [0, oo] satisfies 


Da{P\\Q) < 


min{/i,/2} , 


if a G (2, oo] 

if aG [1,2] 
if a G (^, l) 


where, for a G [0,1), 

fl = /l( it, Pmin) Qminj c) — 


^ min{-21og (l - I) ,/i,/ 2 } , if a G [O, i] 

Qmin log 6 \ 2 


a 

1 — a 


log 1 + 


h = /2(-Pmin,(5min,e,c') = log 1 + 


ee 


‘IQ. 


2P ■ / \ 2 

mm / \ 


Pmin loge^ ^2 


(54) 


(55) 

(56) 


Proof: The Renyi divergence of order oo satisfies (see, e.g., IH Theorem 6]) 

Doo{P\\Q) = log ^ess sup ^ 

Since, by assumption, fhe probabilify measures P and Q are defined on a common finife sef A 

Doo{P\\Q) = log fmax^^ 

Y C^ydj 

P{a) - Qia) 


= log ( 1 + max 
a£A 


< log 1^1 + 

< log ( 1 + 


Q{a 

maxgg^ |P(a) - Q{a)\ 
minag^(5(a) 

\P-Q\' 


2Q. 


( 57 ) 
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where the last inequality follows from (l35l) . Since the Renyi divergence of order a G [0, oo] is 
monotonic non-decreasing in a (see, e.g., l|6l Theorem 3]), it follows from (l57l) that 


D^iPWQ) <D^{P\\Q)< log(l + -^), VaG[0,oo] (58) 

which proves the first line in (l54l) when the validity of the hound is restricted to a G (2,oo]. 

For proving the second line in (l54b . it is shown that the hound in (ITtI) can he sharpened hy 
replacing D{P\\Q) on the LHS of (iTTl) with the quadratic Renyi divergence D 2 {P\\Q) (note 
that D 2 {P\\Q) > D{P\\Q)), leading to 

D2iP\\Q)<log(l+^-^—^). (59) 

The strengthened inequality in (l59l) . in comparison to (iTTl) . follows hy replacing inequality (l3TI) 
with the equality in (l39l ). Combining (l36l) and (l39l ) gives inequality (l59l) . and 

D^{P\\Q) < D 2 {P\\Q) <log(l +, VaG[0,2]. (60) 


The comhination of (1581 ) with (l60l ) gives the second line in (l54l ) (note that ss' = min{e,e^}) 
while the validity of the hound is restricted to a G [1, 2]. 

ForaG {0,1), Da{P\\Q) satisfies the skew-symmetry property/^^(PlIQ) = -^^-Di-aiQWP) 
(see, e.g., 161 Proposition 2]). Consequently, we have 

Da{P\\Q) = (r^) Di-MP) 


^ (l^) 


< 


a 


1 — a 


log 1 -h 


Qr 


log e ^ ^2 


VaG(0,l) (61) 


2F’min, 

where the first inequality holds since the Renyi divergence is monotonic non-decreasing in its 
order, and the second inequality follows from Theorem [3] which implies that 

^ loge _,_^Q{a) o 


D{Q\\P) < log 1 + 


< log 1^1 -h 


2Pr 

e 


min 

2 


2Prr 


2 

Q. 


■ mm 


a^A P{a) 
log e 


The third line in (l54l) follows from (l5^ . (l60l) and (IfiTl) while restricting the validity of the hound 
to a G (^, l)- 

For proving the fourth line in (l54l) . note that from (ITTI) Dii 2 {P\\Q) = —2logZ{P,Q) where 
Z{P,Q) = \/P{^)Q{^) Bhattacharyya coefficient between P and Q lfT2l . The 

Bhattacharyya distance is defined as minus the logarithm of the Bhattacharyya coefficient, which 
is non-negative in general and it is zero if and only il P = Q (since 0 < Z{P,Q) < 1, and 
Z{P,Q) = 1 if and only if P = Q). Hence, the Renyi divergence of order ^ is twice the 
Bhattacharyya distance. Based on the inequality Z{P, Q) >1-M 

which follows from ifTOl 

Example 6.2] (see also ll^ Proposition 1]), we have 


Pa(P||Q)<Pl/2(P||Q)<-21og(l 



Va G 



(62) 


where e = \P — Q\ G [0, 2]. Finally, the last case in (l54b follows from (l5^ . (l60l) . (l6TI) and (l62l) . 
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B. Example: Renyi Divergence for Multinomial Distributions 

Let Xi,X 2 ,... be independent Bernoulli random variables with Xi ~ Bernoulli(pj), and let 
Yi, Y 2 ,... be independent Bernoulli random variables with Yj ~ Bernoulli(gj) (assume w.l.o.g. 
that Qi < ^). Let 11^ and Vn be the partial sums Un = I®*- 

Pu^,Pv„ denote their multinomial distributions. For all a € [0,2] and n G N, we have 


Da{PuAPv^) 


(a) 


^^Y.^APx^\PY.) 

i=l 


(c) 

< log 



\Px,-PyA \ 

2 {PyXr,,. ) 


= + ^ (63) 

where inequality (a) follows from the data processing inequality for the Renyi divergence (see 
m Theorem 9]), equality (b) follows from the additivity property of the Renyi divergence under 
the independence assumption for {Xi} and for {Yj} (see Ihl Theorem 28]), inequality (c) follows 
from Theorem m and equality (d) holds since \Pxi — PYi\ = 2|pj — %[ for Bernoulli random 
variables, and (FVjmin = minimi, 1 — qt} = Qi {qi < ^). Similarly, for all a > 2 and n G N, 


Pa{PUn\\PvA < log ( 1 + 2 

j=l 

The only difference in the derivation of (l64l ) is in inequality (c) of (|6^ where the bound in the 
first line of (l54l) is used this time. 

Let {sn}^=i be a non-negative sequence such that 


Pi 

Qi 


- 1 


(64) 


(1 ^n)qn Pn ^ (1 Y £n)9n) V 71 G M 


and 


'^el<Qo. 

n=l 

Then, from (|6^ . it follows that Da{PuAPvA < Pi for all a G [0,2] and n G N where 

OO 

Ki ^ ^ log (1 + 4) < 


OC. 


n=l 


Furthermore, if < oo, it follows from (l64l) that Da{PuAPvA ^ P 2 for all a > 2 

and 77 G N where 


7^2 = J]log(l + 2e„) < 


OC. 


n=l 


Note that although Da{Pxi II Yy) in equality (b) of (l6^ is equal to the binary Renyi divergence 
(^) log(pf4"“ + (1 - , if a G (0,1) U (1, 00 ) 

Pi log (fr) + (f - Pi) log (t^) ’ 


da{Pi\\qi) = 


if a = 1 


the reason for the use of the upper bounds in step (c) of (l6^ and (l64b is to state sufficient 
conditions, in terms of {£n}'A=i’ for the boundedness of the Renyi divergence Da{PuAPv-n)- 
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V. The Exponential Decay of the Probability for a Non-Typical Sequence 

Let = (C/i,.. ., Un) be a sequence of i.i.d. symbols that are emitted by a memoryless 
and stationary source with distribution Q and a finite alphabet A. Let |^| = r < oo denote 
the cardinality of the source alphabet, and assume that all symbols are emitted with positive 
probability (i.e., Qmin — iTLinag _4 Q{a) > 0). The empirical probability distribution of the emitted 
sequence Pun is given by 

1 ^ 

PuN{a) =—^l{Uk = a], VoG^. 

^ k=l 

Lor an arbitrary <5 > 0, let the (5-typical set be defined as 

Tq{6) = \^u^ eA^:\Pu^{a)-Q{a)\<6 Q{a), VaG^}, (65) 

i.e., the empirical distribution of every symbol in an N-length (5-typical sequence deviates from 
the true distribution of this symbol by a fraction of less than 6. Consequently, the complementary 
of (1^ is given by 

TQ{5y = eA^:3aeA, |Ax^(a) - Q(o)| >(5Q(a)}. 

Lrom Sanov’s theorem (see ||3l Theorem 11.4.1]), the asymptotic exponential decay of the 
probability that a sequence is not (5-typical, for a specified 5 > 0, is given by 

lim -4 logQ^(7Q((5)") = mm D{P\\Q) (66) 

1\—^OO iV FGFq 


where 

Vq = |p is a probability measure on {A,F): 3a ^ A, |L’(a) — (5(a)| > (5(5(a)|. (67) 

We obtain in the following explicit upper and lower bounds on the exponential decay rate on 
the RHS of (l66l ). The emphasis is on the upper bound, which is based on Theorem |3j and we 
first introduce the lower bound for completeness. The derivation of the lower bound is similar to 
the analysis in ifldl Section 4]; note, however, that there is a difference between the 5-typicality 
in lfT4l Eq. (19)] and the way it is defined in (1651) . The probabilify-dependenf refinemenf of 
Pinsker’s inequality (see |[T4l Theorem 2.1]) states that 

D{P\\Q) > ifiTTg) \P - Q\^ (68) 

where 

ttq = maxmin{(5(^), I - Q{A)] < (69) 


Ap) 


4(l-2p) loS ( , 

lose 
2 ’ 


and 


if p G [0, i), 
ifp = i 


(70) 
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is a monotonic decreasing and continuous function. Hence, and (IMI) forms a 

probability-dependent refinement of Pinsker’s inequality ifM . From (167] ) and ([68] ). we have 


min D{P\\Q) 

P&Vq 




min IP — 
P&Pq 


= V’i'Pg) 


( min (5 
\a£A 



2 


> 




(71) 

(72) 


where the transition from (171!) to (17^ follows from the global lower bound on (^(ttq). 

We derive in the following an upper bound on the asymptotic exponential decay rate in dh^ : 


min D(P\\Q) 

P&Pq 


(a) 


< min < log I 1 + 


PdpQ 


= log 1 + 


= log 1 + 


\P-Q? 

2Q 

min 


(minpgT^Q \P -Q\y 
2Q 

min 

(mingg^ {6Q{a)Y \ 
2(3min J 


, I 1 , Qmin ^ \ A 7^ 

= log 1 + --- = Pu 


(73) 


where inequality (a) follows from (ITtI) . and equality (b) follows from (l67]) . 

The ratio between the upper and lower bounds on the asymptotic exponent in (l6^ . as given 
in (1711) and (1731) respectively, satisfies 


1 


< ^ 
- El 


1 loge log (l + 

Qmin 2y?(7rQ) loge ■ QminS^ 

1 

Qmin 


(74) 


where inequality (1741) follows from the fact that the second and third multiplicands in (1741) are 
both less than or equal to 1. Note that both bounds in (ITTl) and (1731) scale like for <5 Ri 0. 


Appendix: A Proof of Inequality (l32l) 

This appendix proves inequality (1321) . which provides upper and lower bounds on the difference 
log(l + X^{PjQ)) ~ E{P\\Q) in terms of the dual relative entropy D{Q\\P). To this end, we 
first prove a new inequality relating /-divergences lITTI . and the bounds in (1321) then follow as 
a special case. 
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Recall the following definition of an /-divergence: 

Definition 1: Let /: (0, oo) —)• M be a convex function with /(I) = 0, and let P and Q be 
two probability measures defined on a common finite set A. The f-divergence from P to Q is 
defined by 


(75) 


with the convention that 


0/(5) =0, /(0)=ton/(t), 

0/(^) = lira tf(^^ = b lim V6 > 0. 

VO/ *->-0+ u^oo U 


(76) 


Proposition 1: Let /: (0, 00 ) —)• R be a convex function with /(I) = 0 and assume that the 
function g: (0, 00 ) —)• R, defined by g{t) = —tf{t) for every f > 0, is also convex. Let P and Q 
be two probability measures that are defined on a finite set A, and assume that P, Q are strictly 
positive. Then, the following inequality holds: 

+ <m» (77) 

Proof: Let A = {ai,..., 0 ^}, and u = (ui,..., Un) G R+ be an arbitrary n-tuple with 
positive entries. Define 


Jnif, -P) - X] ~ f 1 ’ 


i=l 

n 


Ki=l 

^ n 


(78) 


Jnif^ <3) = XI ( X ) • 

i=l \j=l / 

The following refinement of Jensen’s inequality has been introduced in 111 Theorem 1] for a 
convex function /: (0, (X)) R: 

min • Jn{f,u,Q) < Jn{f,u,P) < . max ■ Jn{f,u,Q)- (79) 


iG{l,...,n} Q{ai) 


ie{i,...,n} Q{ai) 


Let Ui = for z G {1,..., n}. Calculation of (178] ) gives that 


Jn{f,u,Q) = '^Q{ai)f 


2=1 


Pjai 

Q(ai 


/ X^(^ 


y 2 = 1 


P^ 

Q{ai) 


E«“)/(Sl-/(i) 


= Df{P\\Q), 


Jnif,u,P) = ^P{ai)f 


2=1 


(a) 


'^Q{ai)g 


Q{a) 


Pjai 

Qi^tli 
P{a. 


(80) 


t E 


P(ai 




\Qi^i) J \i=l 

^^-DgiP\\Q)-f{l + x\P,Q)) 


( 81 ) 
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where equality (a) holds hy the definition of g, and equality (h) follows from equalities (1^ and 
(TTSI) . The substitution of ([80l) and ([Mil into (17^ gives inequality (TTT]) . ■ 

As a consequence of Proposition [T] we prove inequality (l32l) . Let f{t) = — log(f) for t > 0. 
The function /: (0,oo) —>■ M is convex with /(I) = 0, and g{t) = = tlog{t) for 

t > 0 is also convex with ^(1) = 0. Inequality (l32l) follows hy suhstituting f,g into (177]) where 
Df{P\\Q) = D{Q\\P) and Dg{P\\Q) = D{P\\Q). Inequality (l32l) also holds in the case where 
P is not strictly positive on A with the convention in (1761) where OlogO = limt_>.o+ g{t) = 0. 
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